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Abstract 

Automatic text categorization is a complex and useful task for many 
natural language processing applications. Recent approaches to text cat- 
egorization focus more on algorithms than on resources involved in this 
operation. In contrast to this trend, we present an approach based on the 
integration of widely available resources as lexical databases and train- 
ing collections to overcome current limitations of the task. Our approach 
makes use of WordNet synonymy information to increase evidence for 
bad trained categories. When testing a direct categorization, a Word- 
Net based one, a training algorithm, and our integrated approach, the 
latter exhibits a better perfomance than any of the others. Incidentally, 
WordNet based approach perfomance is comparable with the training 
approach one. 



1 Introduction 

Text categorization (TC) is the classification of documents with respect to a set 
of one or more pre-existing categories. TC is a hard and very useful operation 
frequently applied to the assignment of subject categories to documents, to route 
and filter texts, or as a part of natural language processing systems. 

In this paper we present an automatic TC approach based on the use of 
several linguistic resources. Nowadays, many resources like training collections 
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and lexical databases have been successfully employed for text classification tasks 
[Q, but always in an isolated way. The current trend in the TC field is to pay 
more attention to algorithms than to resources. We believe that the key idea for 
the improvement of text categorization is increasing the amount of information 
a system makes use of, through the integration of several resources. 

We have chosen the Information Retrieval vector space model for our ap- 
proach. Term weight vectors are computed for documents and categories em- 
ploying the lexical database WordNet and the training subset of the test 
collection Reuters-22173. We calculate the weight vectors for: 

• A direct approach, 

• a Wordnet based approach, 

• a training collection approach, 

• and finally, a technique for integrating WordNet and a training collec- 
tion. 

Later, we compare document-category similarity by means of a cosine-based 
function. We have driven a series of experiments on the test subset of Reuters- 
22173, which yields two conclusions. First, the integrated approach performs 
better than any of the other ones, confirming the hypothesis that the more 
informed a text classification system is, the better it performs. Secondly, the 
lexical database oriented technique can rival with the training approach, avoid- 
ing the necessity of cost-expensive building of training collections for any domain 
and classification task. 

2 Task Description 

Given a set of documents and a set of categories, the goal of a categorization 
system is to decide whether any document belongs to any category or not. The 
system makes use of the information contained in a document to compute a 
degree of pertainance of the document to each category. Categories are usually 
subject labels like art or MILITARY, but other categories like text genres are 
also interesting [0. Documents can be news stories, e-mail messages, reports, 
and so forth. 

The most widely used resource for TC is the training collection. A training 
collection is a set of manually classified documents that allows the system to 
guess clues on how to classify new unseen documents. There are currently 
several TC test collections, from which a training subset and a test subset can 
be obtained. For instance, the huge TREC collection §, OHSUMED j§ and 
Reuters-22173 || have been collected for this task. We have selected Reuters 
because it has been used in other work, facilitating the comparison of results. 
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Lexical databases have been rarely employed in TC, but several approaches 
have demonstrated their usefulness for term classification operations like word 
sense disambiguation |6|, |tJ. A lexical database is a reference system that ac- 
cumulates information on the lexical items of one o several languages. In this 
view, machine readable dictionaries can also be regarded as primitive lexical 
databases. Current lexical databases include WordNet ||, EDR H and Ro- 
get's Thesaurus. WordNet's large coverage and frequent utilization has led us 
to use it for our experiments. 

We organize our work depending on the kind and number of resources in- 
volved. First, a direct approach in which only the categories themselves are the 
terms used in representation has been tested. Secondly, WordNet by itself has 
been used for increasing the number of terms and so, the amount of predicting 
information. Thirdly, we have made use of the training subset of Reuters to ob- 
tain the categories representatives. Finally, we have employed both WordNet 
and Reuters to get a better representation of undertrained categories. 

3 Integrating Resources in the Vector Space 
Model 

The Vector Space Model (VSM) [|l0| is a very suitable environment for expressing 
our approaches to TC: it is supported by many experiences in text retrieval 
H 0|; it allows the seamless integration of multiple knowledge sources for 
text classification; and it makes it easy to identify the role of every knowledge 
source involved in the classification operation. In the next sections we present a 
straightforward adaptation of the VSM for TC, and the way we use the chosen 
resources for calculating several model elements. 

3.1 Vector Space Model for Text Categorization 

The bulk of the VSM for Information Retrieval (IR) is representing natural lan- 
guage expressions as term weight vectors. Each weight measures the importance 
of a term in a natural language expression, which can be a document or a query. 
Semantic closeness between documents and queries is computed by the cosine 
of the angle between document and query vectors. 

Exploiting an obvious analogy between queries and categories, the latters 
can be represented by term weight vectors. Then, a category can be assigned 
to a document when the cosine similarity between them exceeds a certain 
threshold, or when the category is highly ranked. In a closer look, and given 
three sets of N terms, M documents and L categories, the weight vector for 
document j is {wd\j , wc?2j , . . . ,wd]\[j) and the weight vector for category k is 
(wcik,wc2k, ■ ■ ■ ,wcNk) ■ The similarity between document j and category k is 
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obtained with the formula: 
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Term weights for document vectors can be computed making use of well 
known formulae based on term frequency. We use the following one from |0|: 

wdij = tfij • log 2 — 

Where tfij is the frequency of term i in document j, and dfi is the number 
of documents in which term i occurs. Now, only weights for category vectors 
are to be obtained. Next we will show how to do it depending on the resource 
used. 

3.2 Direct Approach 

This approach to TC makes no use of any resource apart to the documents to 
be classified. It tests the intuition that the name of content-based categories 
is a good predictor for the occurrence of these categories. For instance, the 
occurrence of the word "barley" in a document suggests that this one should be 
classified in the barley^ category. We have taken exactly the categories names, 
although classification in more general categories like strategic-metal should 
rather relay on the occurrence of more specific words like "gold" or "zinc." 

In this approach, the terms used for the representation are just the categories 
themselves. The weight of term i in the vector for category k is 1 if i = k and 
in other cases. Multiword categories imply the use of multiword terms. For 
example, the expression "balance of payments" is considered as one term. When 
categories consist of several synonyms (like iron-steel), all of them are used 
in the representation. Since the number of categories in Reuters is 135, and two 
of them are composite, these approach produces 137-component vectors. 

3.3 WORDNET-based Approach 

Lexical databases contain many kinds of information (concepts; synonymy and 
other lexical relations; hyponymy and other conceptual relations; etc.). For 
instance, WordNet represents concepts as synonyms sets, or synsets. We have 
selected this synonymy information, performing a "category expansion" similar 

1 All the following examples are taken from the Reuters category set and involve words that 
actually occur in the documents. 
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to query expansion in IR. For any category, the synset it belongs to is selected, 
and any other term belonging to it is added to the representation. This technique 
increases the amount of evidence used to predict category occurrence. 

Unfortunately, the disambiguation of categories with respect to WordNet 
concepts is required. We have performed this task manually, because the small 
number of categories in the test collection made it affordable. We are currently 
designing algorithms for automating this operation. 

After locating categories in WordNet, a term set containing all the cate- 
gory's synonyms has been built. For the 135 categories used in this study, we 
have produced 368 terms. Although some meaningless terms occur and could 
be deleted, we have developed no automatic criteria for this at the moment. 

Let us take a look to one example. The fuel category has driven us to 
the addition of the terms "combustible" and "combustible material," since they 
belong to the same synset in WordNet. In general, the term weight vector for 
category k is 1 for every synonym of the category an for any other term. 

3.4 Training Collection Approach 

The key asumption when using a training collection is that a term often oc- 
curring within a category and rarely within others is a good predictor for that 
category. A set of predictors is typically computed from term to category co- 
ocurrence statistics, as a training step. The computation depends on the ap- 
proach and algorithm selected. As Lewis has done before ||, we have replicated 
in the VSM early Bayesian experiments that had reported good results. 

Terms are selected according to the number of times they occur within cat- 
egories. Those terms which co-occur at least with the 1% and at most with 
the 10% of the categories are taken. Among them, those 286 with higher doc- 
ument frequency are selected. We work the weights out in the same way as in 
documents vectors: 

wc ik = tf ik ■ log 2 

Cfi 

Where tfik 1S the number of times that term i occurs within documents 
assigned to category k, and cfi is the number of categories within term i oc- 
curs. For example, after selecting and weighting categories, the high-frequency 
term "export" shows its largest weight for category trade, but it also shows 
large weights for GRAIN or wheat, and small weights for belgian-franc and 
WOOL. A less frequent term typically provides evidence for a smaller number 
of categories. For example, "private" has a large weight only for acquisition, 
and medium for EARNINGS and TRADE. 
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3.5 Integrating WordNet and a Training Collection 

Several ways of integrating WordNet and Reuters have occurred to us. A 
sensible one is to use concepts instead of terms as representatives. However, 
and although promising, Voorhees reported no improvements with this idea 
[ p"2| . On the other side, we have realized that the shortcomings in training 
can be corrected using WordNet to provide better forecast of low frequency 
categories. 

In general, we have linked WordNet weight vectors to training weigth 
vectors. First we have removed those WordNet terms not ocurring in the 
training collection. Then we have normalized both WordNet vectors and 
training vectors to separately add up across each category. This way we have 
smoothed training weights (much larger than WordNet ones), giving equal 
influence to each kind of term weight. This technique results in 461 term weights 
vectors, 185 coming from WordNet, and 286 from training. Weights for terms 
ocurring in both sets have been summed. Examples of terms coming from 
training are "import" or "government," with high weights for highly frequent 
categories, like ACQ (acquisition). Examples of terms coming from WordNet 
are "petroleum" or "peanut," with weights only for the corresponding categories 
CRUDE and GROUNDNUT respectively. 

We can clearly identify the role of each resource in this TC approach. Word- 
Net supplies information on the semantic relatedness of terms and categories 
when training data is no longer available or reliable. It directly contributes 
with part of the terms used in the vector representation. On the other side, the 
training collection supplies terms for those categories that are better trained. 
The problem of unavailability of training data is then overcome through the use 
of an extern resource. 

4 Evaluation 

Evaluation of TC and other text classification operations exhibits great het- 
erogeneity. Several metrics and test collections have been used for different 
approaches or works. This results in a lack of comparability among the ap- 
proaches, forcing to replicate experiments from other researchers. Trying to 
minimize this problem, we have chosen a set of very extended metrics and a 
frequently used free test collection for our work. The metrics are recall and pre- 
cision, and the test collection is, as introduced before, Reuters-22173. Before 
stepping into the actual results, we provide a closer look to these elements. 

4.1 Evaluation metrics 

The VSM promotes recall and precision based evaluation, but there are sev- 
eral ways of calculating or even defining them. We focus on recall, being the 
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discussion analogous for precision. First, definition can be given regarding cat- 
egories or documents [|l3|. Second, computation can be done macro-averaging 
or micro -averaging H. 



• Recall can be denned as the number of correctly assigned documents to 
a category over the number of documents to be correctly assigned to the 
category. But a document-oriented definition is also possible: the number 
of correctly assigned categories to a document over the number of cor- 
rect categories to be assigned to the document. This later definition is 
more coherent with the task, but the former allows to identify the most 
problematic categories. 

• Macro-averaging consists of computing recall and precision for every item 
(document or category) in one of both previous ways, and averaging after 
it. Micro-averaging is adding up all numbers of correctly assigned items, 
items assigned, and items to be assigned, and calculate only one value of 
recall and precision. When micro-averaging, no distinction about docu- 
ment or category orientation can be made. Macro-averaging assigns equal 
weight to every category, while micro-averaging is influenced by most fre- 
quent categories. 

Evaluation depends finally on the category assignement strategy: probability 
thresholding, k-per-doc assignment, etc. Strategies define the way to produce 
recall/precision tables. For instance, if simmilarities are normalized to the [0, 1] 
interval, eleven levels of probability threshold can be set to 0.0, 0.1, and so. 
When the system performs k-per-doc assignment, the value of k is ranged from 
1 to a reasonable maximum. 

We must assign an unknown number of categories to each document in 
Reuters. So, the probability thresholding approach seems the most sensible 
one. We have then computed recall and precision for eleven levels of threshold, 
both macro and micro-averaging. When macro-averaging, we have used the 
category-oriented definition of recall and precision. After that, we have calcu- 
lated averages of those eleven values in order to get single figures for comparison. 

4.2 The Test Collection 

The Reuters-22173 collection consists of 22,173 newswire articles from Reuters 
collected during 1987. Documents in Reuters deal with financial topics, and were 
classified in several sets of financial categories by personnel from Reuters Ltd. 
and Carnegie Group Inc. Documents vary in length and number of categories 
assigned, from 1 line to more than 50, and from none categories to more than 8. 
There are five sets of categories: TOPICS, ORGANIZATIONS, EXCHANGES, 
PLACES, and PEOPLE. As others before, we have selected the 135 TOPICS 
for our experiments. An example of news article classified in bop (balance of 
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PATTERN-ID 6505 TRAINING-SET 
18-JUN-1987 11:44:27.20 
TOPICS: 
PLACES : 
PEOPLE: 
ORGS: 

EXCHANGES : 
COMPANIES : 



bop trade END-TOPICS 
italy 



END-PLACES 
END-PEOPLE 
END-ORGS 
END-EXCHANGES 
END-COMPANIES 
ITALIAN BALANCE OF PAYMENTS IN DEFICIT IN MAY 

ROME, June 18 - Italy's overall balance of payments showed 
a deficit of 3,211 billion lire in May compared with a surplus 
of 2,040 billion in April, provisional Bank of Italy figures 
how. 

The May deficit compares with a surplus of 1,555 billion 
lire in the corresponding month of 1986. 

For the first five months of 1987, the overall balance of 
payments showed a surplus of 299 billion lire against a deficit 
of 2,854 billion in the corresponding 1986 period. 
REUTER 



Figure 1: Document number 6505 from Reuters. 



payments) and TRADE is shown in Figure [|. Some spurious formatting has been 
removed from it. 

When a test collection is provided, it is customary to divide it into a training 
subset and a test subset. Several partitions have been suggested for Reuters ||, 
among which ones we have opted for the most general and difficult one. First 
21,450 news stories are used for training, and last 723 are kept for testing. We 
summarize significative differences between test and training sets in Figure ^|. 
These differences can bring noise into categorization, because training relies on 
similarity between training and test documents. Nevertheless, this 21,450/723 
partition has been used before [|[ |l4| and involves the general case of documents 
with no categories assigned. 

We have worked with raw data provided in the Reuters distribution. Con- 
trol characters, numbers and several separators like '/' have been removed, and 
categories different from the TOPICS set have been ignored. For disambiguat- 
ing categories with respect to WordNet senses, we first had to acquire their 
meaning, not always self-evident. This task has been performed by direct ex- 
amination of training documents. 
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Subcollection 





Training 


Test 


Total 


Docs. 


Number 


21,450 


723 


22,173 


Words 


Occurrences 


2,851,455 


140,922 


2,992,377 




Doc. average 


127 


195 


134 


Docs, with 1+ Topics 


Number 


11,098 


566 


11,664 




Percentage 


52 


78 


53 


Topics 


Occurrences 


13,756 


896 


14,652 




Doc. Average 


0.64 


1.24 


0.66 



Figure 2: Reuters-22173 document collection statistics. 



Threshold 
strategy 


Macro-averaging 


Micro-averaging 


Recall Precision 


Recall Precision 


Direct 
WordNet 
Training 
Integrated 


0.239302 0.242661 
0.324899 0.306445 
0.325586 0.188701 
0.373365 0.220186 


0.205849 0.235775 
0.260762 0.298363 
0.365988 0.275731 
0.418652 0.296423 



Figure 3: Overall results from our experiments. 



4.3 Results and Interpretation 

The results of our first series of experiments are summarized in the table in 
Figure ||. This table shows recall and precision averages calculated both macro 
and micro-averaging for a threshold-based assignment strategy. Values for the 
integrated approach show some general advantage over WordNet and training 
approaches, but results are not decisive. Training results are comparable with 
those from Lewis and the WordNet approach is roughly equivalent to the 
training one. 

On one hand, the integrated approach shows a better performance than the 
WordNet one in general, although a problem of precision is detected when 
macro- averaging. The influence of low precison training has produced this effect. 
We are planning to strengthen WordNet influence to overcome this problem. 
On the other hand, the integrated approach reports better general perfomancc 
than the training approach. 

As expected, WordNet and training both beat the direct approach. When 
comparing WordNet and training approaches, we observe that the former 
produces better results with categories of low frequency, while the latter perfoms 
better in highly frequent categories. However, both exhibit the same overall 
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behaviour. Differences in categories are noticed by the fact that micro- averaging 
is influenced by highly frequent elements, while macro-averaging depends on the 
results of many elements of low frequency. 



5 Related Work 

Text categorization has emerged as a very active field of research in the recent 
years. Many studies have been conducted to test the accuracy of training meth- 
ods, although much less work has been developed in lexical database methods. 
However, lexical databases and especially WordNet have been often used for 
other text classification tasks, like word sense disambiguation. 

Many different algorithms making use of a training collection have been 
used for TC, including k-nearest-neighbor algorithms IE], Bayesian classifiers 
[ 5[, learning algorithms based in relevance feedback [16| or in decision trees 
[17|, or neural networks |1S|] . Apart from ||, the closest approach to ours is 
the one from Larkey and Croft Jl^j , who combine k-nearest-neighbor, Bayesian 
independent and relevance feedback classifiers, showing improvements over the 
separated approaches. Although they do not make use of several resources, their 
approach tends to increase the information available to the system, in the spirit 
of our hypothesis. 

To our knowledge, lexical databases have been used only once in TC. Hearst 
[ [l9| adapted a disambiguation algorithm by Yarowsky using WordNet to rec- 
ognize category occurrences. Categories are made of WordNet terms, which 
is not the general case of standard or user-defined categories. It is a hard task 
to adapt WordNet subsets to pre-existing categories, especially when they are 
domain dependent. Hearst's approach shows promising results confirmed by the 
fact that our WORDNET-based approach performs at least equally to a simple 
training approach. 

Lexical databases have been employed recently in word sense disambigua- 
tion. For example, Agirre and Rigau [Q make use of a semantic distance that 
takes into account structural factors in WordNet for achieving good results 
for this task. Additionally, Resnik || combines the use of WordNet and a 
text collection for a definition of a distance for disambiguating noun groupings. 
Although the text collection is not a training collection (in the sense of a col- 
lection of manually labelled texts for a pre-defined text processing task), his 
approach can be regarded as the most similar to ours in the disambiguation 
task. Finally, Ng and Lee [^0] make use of several sources of information inside 
a training collection (neighborhood, part of speech, morfological form, etc.) to 
get good results in disambiguating unrestricted text. 

We can see, then, that combining resources in TC is a new and promising 
approach supported by previous research in this and other text classification 
operations. With more information extracted from WordNet and better train- 
ing algorithms, automatic TC integrating several resources could compete with 
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manual indexing in quality, and beat it in cost and efficiency. 

6 Conclusions and Future Work 

In this paper, we have presented a multiple resource approach for TC. This 
approach integrates the use of a lexical database and a training collection in a 
vector space model for TC. The technique is based on improving the language 
of representation construction through the use of the lexical database, which 
overcomes training deficiencies. We have tested our approach against training 
algorithms and lexical database algorithms, reporting better results than both of 
these techniques. We have also acknowledged that a lexical database algorithm 
can rival training algorithms in real world situations. 

Two main work lines are open: first, we have to conduct new series of exper- 
iments to check the lexical database and the combined approaches with other 
more sophisticated training approaches; second, we will extend the multiple re- 
source technique to other text classification tasks, like text routing or relevance 
feedback in text retrieval. 
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